# Copyright (C) 2013-2021 Yuchen Pei.

# Permission is granted to copy, distribute and/or modify this
# document under the terms of the GNU Free Documentation License,
# Version 1.3 or any later version published by the Free Software
# Foundation; with no Invariant Sections, no Front-Cover Texts, and
# no Back-Cover Texts. You should have received a copy of the GNU
# Free Documentation License. If not, see <https://www.gnu.org/licenses/>.

# This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

#+title: Shapley, LIME and SHAP

#+date: <2018-12-02>

In this post I explain LIME (Ribeiro et. al. 2016), the Shapley values
(Shapley, 1953) and the SHAP values (Strumbelj-Kononenko, 2014;
Lundberg-Lee, 2017).

*Acknowledgement*. Thanks to Josef Lindman Hörnlund for bringing the
LIME and SHAP papers to my attention. The research was done while
working at KTH mathematics department.

/If you are reading on a mobile device, you may need to "request desktop
site" for the equations to be properly displayed. This post is licensed
under CC BY-SA and GNU FDL./

** Shapley values
   :PROPERTIES:
   :CUSTOM_ID: shapley-values
   :ID:       1f4de00d-669e-421a-bd48-f4f1da8400f3
   :END:
A coalitional game $(v, N)$ of $n$ players involves

- The set $N = \{1, 2, ..., n\}$ that represents the players.
- A function $v: 2^N \to \mathbb R$, where $v(S)$ is the worth of
  coalition $S \subset N$.

The Shapley values $\phi_i(v)$ of such a game specify a fair way to
distribute the total worth $v(N)$ to the players. It is defined as (in
the following, for a set $S \subset N$ we use the convention $s = |S|$
to be the number of elements of set $S$ and the shorthand
$S - i := S \setminus \{i\}$ and $S + i := S \cup \{i\}$)

$$\phi_i(v) = \sum_{S: i \in S} {(n - s)! (s - 1)! \over n!} (v(S) - v(S - i)).$$

It is not hard to see that $\phi_i(v)$ can be viewed as an expectation:

$$\phi_i(v) = \mathbb E_{S \sim \nu_i} (v(S) - v(S - i))$$

where $\nu_i(S) = n^{-1} {n - 1 \choose s - 1}^{-1} 1_{i \in S}$, that
is, first pick the size $s$ uniformly from $\{1, 2, ..., n\}$, then pick
$S$ uniformly from the subsets of $N$ that has size $s$ and contains
$i$.

The Shapley values satisfy some nice properties which are readily
verified, including:

- *Efficiency*. $\sum_i \phi_i(v) = v(N) - v(\emptyset)$.
- *Symmetry*. If for some $i, j \in N$, for all $S \subset N$, we have
  $v(S + i) = v(S + j)$, then $\phi_i(v) = \phi_j(v)$.
- *Null player*. If for some $i \in N$, for all $S \subset N$, we have
  $v(S + i) = v(S)$, then $\phi_i(v) = 0$.
- *Linearity*. $\phi_i$ is linear in games. That is
  $\phi_i(v) + \phi_i(w) = \phi_i(v + w)$, where $v + w$ is defined by
  $(v + w)(S) := v(S) + w(S)$.

In the literature, an added assumption $v(\emptyset) = 0$ is often
given, in which case the Efficiency property is defined as
$\sum_i \phi_i(v) = v(N)$. Here I discard this assumption to avoid minor
inconsistencies across different sources. For example, in the LIME
paper, the local model is defined without an intercept, even though the
underlying $v(\emptyset)$ may not be $0$. In the SHAP paper, an
intercept $\phi_0 = v(\emptyset)$ is added which fixes this problem when
making connections to the Shapley values.

Conversely, according to Strumbelj-Kononenko (2010), it was shown in
Shapley's original paper (Shapley, 1953) that these four properties
together with $v(\emptyset) = 0$ defines the Shapley values.

** LIME
   :PROPERTIES:
   :CUSTOM_ID: lime
   :ID:       b34aabbc-2f50-4969-bf7b-87678fd577e6
   :END:
LIME (Ribeiro et. al. 2016) is a model that offers a way to explain
feature contributions of supervised learning models locally.

Let $f: X_1 \times X_2 \times ... \times X_n \to \mathbb R$ be a
function. We can think of $f$ as a model, where $X_j$ is the space of
$j$th feature. For example, in a language model, $X_j$ may correspond to
the count of the $j$th word in the vocabulary, i.e. the bag-of-words
model.

The output may be something like housing price, or log-probability of
something.

LIME tries to assign a value to each feature /locally/. By locally, we
mean that given a specific sample $x \in X := \prod_{i = 1}^n X_i$, we
want to fit a model around it.

More specifically, let $h_x: 2^N \to X$ be a function defined by

$$(h_x(S))_i = 
\begin{cases}
x_i, & \text{if }i \in S; \\
0, & \text{otherwise.}
\end{cases}$$

That is, $h_x(S)$ masks the features that are not in $S$, or in other
words, we are perturbing the sample $x$. Specifically, $h_x(N) = x$.
Alternatively, the $0$ in the "otherwise" case can be replaced by some
kind of default value (see the section titled SHAP in this post).

For a set $S \subset N$, let us denote $1_S \in \{0, 1\}^n$ to be an
$n$-bit where the $k$th bit is $1$ if and only if $k \in S$.

Basically, LIME samples $S_1, S_2, ..., S_m \subset N$ to obtain a set
of perturbed samples $x_i = h_x(S_i)$ in the $X$ space, and then fits a
linear model $g$ using $1_{S_i}$ as the input samples and $f(h_x(S_i))$
as the output samples:

*Problem*(LIME). Find $w = (w_1, w_2, ..., w_n)$ that minimises

$$\sum_i (w \cdot 1_{S_i} - f(h_x(S_i)))^2 \pi_x(h_x(S_i))$$

where $\pi_x(x')$ is a function that penalises $x'$s that are far away
from $x$. In the LIME paper the Gaussian kernel was used:

$$\pi_x(x') = \exp\left({- \|x - x'\|^2 \over \sigma^2}\right).$$

Then $w_i$ represents the importance of the $i$th feature.

The LIME model has a more general framework, but the specific model
considered in the paper is the one described above, with a Lasso for
feature selection.

*Remark*. One difference between our account here and the one in the
LIME paper is: the dimension of the data space may differ from $n$ (see
Section 3.1 of that paper). But in the case of text data, they do use
bag-of-words (our $X$) for an "intermediate" representation. So my
understanding is, in their context, there is an "original" data space
(let's call it $X'$). And there is a one-one correspondence between $X'$
and $X$ (let's call it $r: X' \to X$), so that given a sample
$x' \in X'$, we can compute the output of $S$ in the local model with
$f(r^{-1}(h_{r(x')}(S)))$. As an example, in the example of $X$ being
the bag of words, $X'$ may be the embedding vector space, so that
$r(x') = A^{-1} x'$, where $A$ is the word embedding matrix. Therefore,
without loss of generality, we assume the input space to be $X$ which is
of dimension $n$.

** Shapley values and LIME
   :PROPERTIES:
   :CUSTOM_ID: shapley-values-and-lime
   :ID:       bb960a6a-be79-44ff-968f-000990673738
   :END:
The connection between the Shapley values and LIME is noted in
Lundberg-Lee (2017), but the underlying connection goes back to 1988
(Charnes et. al.).

To see the connection, we need to modify LIME a bit.

First, we need to make LIME less efficient by considering /all/ the
$2^n$ subsets instead of the $m$ samples $S_1, S_2, ..., S_{m}$.

Then we need to relax the definition of $\pi_x$. It no longer needs to
penalise samples that are far away from $x$. In fact, we will see later
than the choice of $\pi_x(x')$ that yields the Shapley values is high
when $x'$ is very close or very far away from $x$, and low otherwise. We
further add the restriction that $\pi_x(h_x(S))$ only depends on the
size of $S$, thus we rewrite it as $q(s)$ instead.

We also denote $v(S) := f(h_x(S))$ and $w(S) = \sum_{i \in S} w_i$.

Finally, we add the Efficiency property as a constraint:
$\sum_{i = 1}^n w_i = f(x) - f(h_x(\emptyset)) = v(N) - v(\emptyset)$.

Then the problem becomes a weighted linear regression:

*Problem*. minimises $\sum_{S \subset N} (w(S) - v(S))^2 q(s)$ over $w$
subject to $w(N) = v(N) - v(\emptyset)$.

*Claim* (Charnes et. al. 1988). The solution to this problem is

$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s)\right)^{-1} \sum_{S \subset N: i \in S} \left({n - s \over n} q(s) v(S) - {s - 1 \over n} q(s - 1) v(S - i)\right). \qquad (-1)$$

Specifically, if we choose

$$q(s) = c {n - 2 \choose s - 1}^{-1}$$

for any constant $c$, then $w_i = \phi_i(v)$ are the Shapley values.

*Remark*. Don't worry about this specific choice of $q(s)$ when $s = 0$
or $n$, because $q(0)$ and $q(n)$ do not appear on the right hand side
of (-1). Therefore they can be defined to be of any value. A common
convention of the binomial coefficients is to set ${\ell \choose k} = 0$
if $k < 0$ or $k > \ell$, in which case $q(0) = q(n) = \infty$.

In Lundberg-Lee (2017), $c$ is chosen to be $1 / n$, see Theorem 2
there.

In Charnes et. al. 1988, the $w_i$s defined in (-1) are called the
generalised Shapley values.

*Proof*. The Lagrangian is

$$L(w, \lambda) = \sum_{S \subset N} (v(S) - w(S))^2 q(s) - \lambda(w(N) - v(N) + v(\emptyset)).$$

and by making $\partial_{w_i} L(w, \lambda) = 0$ we have

$${1 \over 2} \lambda = \sum_{S \subset N: i \in S} (w(S) - v(S)) q(s). \qquad (0)$$

Summing (0) over $i$ and divide by $n$, we have

$${1 \over 2} \lambda = {1 \over n} \sum_i \sum_{S: i \in S} (w(S) q(s) - v(S) q(s)). \qquad (1)$$

We examine each of the two terms on the right hand side.

Counting the terms involving $w_i$ and $w_j$ for $j \neq i$, and using
$w(N) = v(N) - v(\emptyset)$ we have

$$\begin{aligned}
&\sum_{S \subset N: i \in S} w(S) q(s) \\
&= \sum_{s = 1}^n {n - 1 \choose s - 1} q(s) w_i + \sum_{j \neq i}\sum_{s = 2}^n {n - 2 \choose s - 2} q(s) w_j \\
&= q(1) w_i + \sum_{s = 2}^n q(s) \left({n - 1 \choose s - 1} w_i + \sum_{j \neq i} {n - 2 \choose s - 2} w_j\right) \\
&= q(1) w_i + \sum_{s = 2}^n \left({n - 2 \choose s - 1} w_i + {n - 2 \choose s - 2} (v(N) - v(\emptyset))\right) q(s) \\
&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) w_i + \sum_{s = 2}^n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset)). \qquad (2)
\end{aligned}$$

Summing (2) over $i$, we obtain

$$\begin{aligned}
&\sum_i \sum_{S: i \in S} w(S) q(s)\\
&= \sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) (v(N) - v(\emptyset)) + \sum_{s = 2}^n n {n - 2 \choose s - 2} q(s) (v(N) - v(\emptyset))\\
&= \sum_{s = 1}^n s{n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)). \qquad (3)
\end{aligned}$$

For the second term in (1), we have

$$\sum_i \sum_{S: i \in S} v(S) q(s) = \sum_{S \subset N} s v(S) q(s). \qquad (4)$$

Plugging (3)(4) in (1), we have

$${1 \over 2} \lambda = {1 \over n} \left(\sum_{s = 1}^n s {n - 1 \choose s - 1} q(s) (v(N) - v(\emptyset)) - \sum_{S \subset N} s q(s) v(S) \right). \qquad (5)$$

Plugging (5)(2) in (0) and solve for $w_i$, we have

$$w_i = {1 \over n} (v(N) - v(\emptyset)) + \left(\sum_{s = 1}^{n - 1} {n - 2 \choose s - 1} q(s) \right)^{-1} \left( \sum_{S: i \in S} q(s) v(S) - {1 \over n} \sum_{S \subset N} s q(s) v(S) \right). \qquad (6)$$

By splitting all subsets of $N$ into ones that contain $i$ and ones that
do not and pair them up, we have

$$\sum_{S \subset N} s q(s) v(S) = \sum_{S: i \in S} (s q(s) v(S) + (s - 1) q(s - 1) v(S - i)).$$

Plugging this back into (6) we get the desired result. $\square$

** SHAP
   :PROPERTIES:
   :CUSTOM_ID: shap
   :ID:       11af7fd2-a4f9-42da-a275-5d15cde0ca48
   :END:
The paper that coined the term "SHAP values" (Lundberg-Lee 2017) is not
clear in its definition of the "SHAP values" and its relation to LIME,
so the following is my interpretation of their interpretation model,
which coincide with a model studied in Strumbelj-Kononenko 2014.

Recall that we want to calculate feature contributions to a model $f$ at
a sample $x$.

Let $\mu$ be a probability density function over the input space
$X = X_1 \times ... \times X_n$. A natural choice would be the density
that generates the data, or one that approximates such density (e.g.
empirical distribution).

The feature contribution (SHAP value) is thus defined as the Shapley
value $\phi_i(v)$, where

$$v(S) = \mathbb E_{z \sim \mu} (f(z) | z_S = x_S). \qquad (7)$$

So it is a conditional expectation where $z_i$ is clamped for $i \in S$.
In fact, the definition of feature contributions in this form predates
Lundberg-Lee 2017. For example, it can be found in
Strumbelj-Kononenko 2014.

One simplification is to assume the $n$ features are independent, thus
$\mu = \mu_1 \times \mu_2 \times ... \times \mu_n$. In this case, (7)
becomes

$$v(S) = \mathbb E_{z_{N \setminus S} \sim \mu_{N \setminus S}} f(x_S, z_{N \setminus S}) \qquad (8)$$

For example, Strumbelj-Kononenko (2010) considers this scenario where
$\mu$ is the uniform distribution over $X$, see Definition 4 there.

A further simplification is model linearity, which means $f$ is linear.
In this case, (8) becomes

$$v(S) = f(x_S, \mathbb E_{\mu_{N \setminus S}} z_{N \setminus S}). \qquad (9)$$

It is worth noting that to make the modified LIME model considered in
the previous section fall under the linear SHAP framework (9), we need
to make two further specialisations, the first is rather cosmetic: we
need to change the definition of $h_x(S)$ to

$$(h_x(S))_i = 
\begin{cases}
x_i, & \text{if }i \in S; \\
\mathbb E_{\mu_i} z_i, & \text{otherwise.}
\end{cases}$$

But we also need to boldly assume the original $f$ to be linear, which
in my view, defeats the purpose of interpretability, because linear
models are interpretable by themselves.

One may argue that perhaps we do not need linearity to define $v(S)$ as
in (9). If we do so, however, then (9) loses mathematical meaning. A
bigger question is: how effective is SHAP? An even bigger question: in
general, how do we evaluate models of interpretation?

** Evaluating SHAP
   :PROPERTIES:
   :CUSTOM_ID: evaluating-shap
   :ID:       a7249300-814f-499e-b808-c66a17a4d32e
   :END:
The quest of the SHAP paper can be decoupled into two independent
components: showing the niceties of Shapley values and choosing the
coalitional game $v$.

The SHAP paper argues that Shapley values $\phi_i(v)$ are a good
measurement because they are the only values satisfying the some nice
properties including the Efficiency property mentioned at the beginning
of the post, invariance under permutation and monotonicity, see the
paragraph below Theorem 1 there, which refers to Theorem 2 of Young
(1985).

Indeed, both efficiency (the "additive feature attribution methods" in
the paper) and monotonicity are meaningful when considering $\phi_i(v)$
as the feature contribution of the $i$th feature.

The question is thus reduced to the second component: what constitutes a
nice choice of $v$?

The SHAP paper answers this question with 3 options with increasing
simplification: (7)(8)(9) in the previous section of this post
(corresponding to (9)(11)(12) in the paper). They are intuitive, but it
will be interesting to see more concrete (or even mathematical)
justifications of such choices.

** References
   :PROPERTIES:
   :CUSTOM_ID: references
   :ID:       7ba9546e-2485-43e4-be48-16762fc447b2
   :END:

- Charnes, A., B. Golany, M. Keane, and J. Rousseau. "Extremal Principle
  Solutions of Games in Characteristic Function Form: Core, Chebychev
  and Shapley Value Generalizations." In Econometrics of Planning and
  Efficiency, edited by Jati K. Sengupta and Gopal K. Kadekodi, 123--33.
  Dordrecht: Springer Netherlands, 1988.
  [[https://doi.org/10.1007/978-94-009-3677-5_7]].
- Lundberg, Scott, and Su-In Lee. "A Unified Approach to Interpreting
  Model Predictions." ArXiv:1705.07874 [Cs, Stat], May 22, 2017.
  [[http://arxiv.org/abs/1705.07874]].
- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "'Why Should
  I Trust You?': Explaining the Predictions of Any Classifier."
  ArXiv:1602.04938 [Cs, Stat], February 16, 2016.
  [[http://arxiv.org/abs/1602.04938]].
- Shapley, L. S. "17. A Value for n-Person Games." In Contributions to
  the Theory of Games (AM-28), Volume II, Vol. 2. Princeton: Princeton
  University Press, 1953. [[https://doi.org/10.1515/9781400881970-018]].
- Strumbelj, Erik, and Igor Kononenko. "An Efficient Explanation of
  Individual Classifications Using Game Theory." J. Mach. Learn. Res. 11
  (March 2010): 1--18.
- Strumbelj, Erik, and Igor Kononenko. "Explaining Prediction Models and
  Individual Predictions with Feature Contributions." Knowledge and
  Information Systems 41, no. 3 (December 2014): 647--65.
  [[https://doi.org/10.1007/s10115-013-0679-x]].
- Young, H. P. "Monotonic Solutions of Cooperative Games." International
  Journal of Game Theory 14, no. 2 (June 1, 1985): 65--72.
  [[https://doi.org/10.1007/BF01769885]].
